Binary Classification Model for Parkinson's Disease Using Python Take 2

David Lowe

April 26, 2019

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Parkinson's Disease dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: The data used in this study were gathered from 188 patients with PD (107 men and 81 women) with ages ranging from 33 to 87 at the Department of Neurology in Cerrahpasa Faculty of Medicine, Istanbul University. The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82. During the data collection process, the microphone is set to 44.1 KHz and following the physician's examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.

In the first iteration, the script focused on evaluating various machine learning algorithms and identifying the model that produces the best overall metrics. The first iteration established the performance baseline for accuracy and processing time.

In iteration Take2, we will examine the feature selection technique of attribute importance ranking by using the Gradient Boosting algorithm. By selecting only the most important attributes, we hoped to decrease the processing time and maintain a similar level of prediction accuracy compared to the first iteration.

ANALYSIS: In the first iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 81.58%. Two algorithms (Extra Trees and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an accuracy metric of 88.09%. By using the optimized parameters, the Extra Trees algorithm processed the testing dataset with an accuracy of 87.22%, which was just slightly below the prediction accuracy using the training data.

In iteration Take2, the baseline performance of the machine learning algorithms achieved an average accuracy of 82.97%. Two algorithms (Extra Trees and Stochastic Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Extra Trees turned in the top overall result and achieved an accuracy metric of 90.17%. By using the optimized parameters, the Extra Trees algorithm processed the testing dataset with an accuracy of 89.42%, which was just slightly below the prediction accuracy using the training data.

From the model-building perspective, the number of attributes decreased by 585, from 753 down to 168. The processing time went from 18 minutes 16 seconds in the first iteration down to 11 minutes 33 seconds in Take2, which was a decrease of 36.7%.

CONCLUSION: For this iteration, using the Attribute Importance Ranking technique and the Extra Trees algorithm achieved the best overall modeling results. Using feature selection technique further reduced the processing time while achieving an even better prediction accuracy overall. For this dataset, Extra Trees should be considered for further modeling or production use.

Dataset Used: Parkinson's Disease Classification Data Set

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification

The project aims to touch on the following areas:

  • Document a predictive modeling problem end-to-end.
  • Explore data cleaning and transformation options
  • Explore non-ensemble and ensemble algorithms for baseline model performance
  • Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

Section 1 - Prepare Problem

1.a) Load ibraries

In [1]:
import math
import numpy as np
import pandas as pd
import os
import sys
import shutil
import urllib.request
import zipfile
import warnings
import smtplib
from datetime import datetime
from email.message import EmailMessage
from matplotlib import pyplot
from pandas import read_csv
from pandas import get_dummies
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
# from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.externals.joblib import dump
from sklearn.externals.joblib import load
from sklearn.feature_selection import RFE
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from imblearn.combine import SMOTETomek

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.ensemble import BalancedRandomForestClassifier
from imblearn.ensemble import RUSBoostClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from xgboost import XGBClassifier

# Create one random seed number for reproducible results
seedNum = 888

1.b) Set up the email notification function

In [2]:
def email_notify(msg_text):
    sender = os.environ.get('MAIL_SENDER')
    receiver = os.environ.get('MAIL_RECEIVER')
    gateway = os.environ.get('SMTP_GATEWAY')
    smtpuser = os.environ.get('SMTP_USERNAME')
    password = os.environ.get('SMTP_PASSWORD')
    if sender==None or receiver==None or gateway==None or smtpuser==None or password==None:
        sys.exit("Incomplete email setup info. Script Processing Aborted!!!")
    msg = EmailMessage()
    msg.set_content(msg_text)
    msg['Subject'] = 'Notification from Python Binary Classification Script'
    msg['From'] = sender
    msg['To'] = receiver
    server = smtplib.SMTP(gateway, 587)
    server.starttls()
    server.login(smtpuser, password)
    server.send_message(msg)
    server.quit()
In [3]:
email_notify("Library and Data Loading has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

1.c) Load dataset

In [4]:
startTimeScript = datetime.now()

dataset_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00470/pd_speech_features.rar'
dest_file = os.path.basename(dataset_path)
if (os.path.isfile(dest_file) == False) :
    print('Downloading ' + dataset_path + ' as ' + dest_file)
    with urllib.request.urlopen(dataset_path) as in_resp, open(dest_file, 'wb') as out_file:
        shutil.copyfileobj(in_resp, out_file)
    print(dest_file + 'downloaded!')
#     print('Unpacking ' + dest_file)
#     with zipfile.ZipFile(dest_file, 'r') as zip_ref:
#         zip_ref.extractall('.')
#     print(dest_file + 'unpacked!')

inputFile = "pd_speech_features.csv"
xy_original_df = read_csv(inputFile, sep=',', header=1)
xy_original_df.rename(columns={'class': 'targetVar'}, inplace=True)
xy_original_df.drop(columns=['id'], inplace=True)
In [5]:
xy_original_df.head()
Out[5]:
gender PPE DFA RPDE numPulses numPeriodsPulses meanPeriodPulses stdDevPeriodPulses locPctJitter locAbsJitter ... tqwt_kurtosisValue_dec_28 tqwt_kurtosisValue_dec_29 tqwt_kurtosisValue_dec_30 tqwt_kurtosisValue_dec_31 tqwt_kurtosisValue_dec_32 tqwt_kurtosisValue_dec_33 tqwt_kurtosisValue_dec_34 tqwt_kurtosisValue_dec_35 tqwt_kurtosisValue_dec_36 targetVar
0 1 0.85247 0.71826 0.57227 240 239 0.008064 0.000087 0.00218 0.000018 ... 1.5620 2.6445 3.8686 4.2105 5.1221 4.4625 2.6202 3.0004 18.9405 1
1 1 0.76686 0.69481 0.53966 234 233 0.008258 0.000073 0.00195 0.000016 ... 1.5589 3.6107 23.5155 14.1962 11.0261 9.5082 6.5245 6.3431 45.1780 1
2 1 0.85083 0.67604 0.58982 232 231 0.008340 0.000060 0.00176 0.000015 ... 1.5643 2.3308 9.4959 10.7458 11.0177 4.8066 2.9199 3.1495 4.7666 1
3 0 0.41121 0.79672 0.59257 178 177 0.010858 0.000183 0.00419 0.000046 ... 3.7805 3.5664 5.2558 14.0403 4.2235 4.6857 4.8460 6.2650 4.0603 1
4 0 0.32790 0.79782 0.53028 236 235 0.008162 0.002669 0.00535 0.000044 ... 6.1727 5.8416 6.0805 5.7621 7.7817 11.6891 8.2103 5.0559 6.1164 1

5 rows × 754 columns

In [6]:
# Use variable totCol to hold the number of columns in the dataframe
totCol = len(xy_original_df.columns)

# Set up variable totAttr for the total number of attribute columns
totAttr = totCol-1
In [7]:
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# If (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization
targetCol = totCol

# Standardize the class column to the name of targetVar if required
# xy_original_df = xy_original_df.rename(columns={'outcome': 'targetVar'})
In [8]:
# We create training datasets (xy_train_df, x_train_df, y_train_df) for various visualization and cleaning/transformation operations
# We create testing datasets (x_test_df, y_test_df) for various visualization and cleaning/transformation operations
# Use 70% of the data to train the models and the remaining for testing/validation

if targetCol == totCol:
    x_original_df = xy_original_df.iloc[:,0:totAttr]
    y_original_df = xy_original_df.iloc[:,totAttr]
else:
    x_original_df = xy_original_df.iloc[:,1:totCol]
    y_original_df = xy_original_df.iloc[:,0]

validation_size = 0.30
x_train_df, x_test_df, y_train_df, y_test_df = train_test_split(x_original_df, y_original_df, test_size=validation_size, random_state=seedNum)
xy_train_df = pd.concat([x_train_df, y_train_df], axis=1)
print("xy_original_df.shape: {} xy_train_df.shape: {}".format(xy_original_df.shape, xy_train_df.shape))
print("x_train_df.shape: {} y_train_df.shape: {}".format(x_train_df.shape, y_train_df.shape))
print("x_test_df.shape: {} y_test_df.shape: {}".format(x_test_df.shape, y_test_df.shape))
xy_original_df.shape: (756, 754) xy_train_df.shape: (529, 754)
x_train_df.shape: (529, 753) y_train_df.shape: (529,)
x_test_df.shape: (227, 753) y_test_df.shape: (227,)

1.d) Set up the key parameters to be used in the script

In [9]:
# Set up the number of CPU cores available for multi-thread processing
cpu_num = None

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol = 4
if totAttr % dispCol == 0 :
    dispRow = totAttr // dispCol
else :
    dispRow = (totAttr // dispCol) + 1
    
# Set figure width to display the data visualization plots
fig_size = pyplot.rcParams["figure.figsize"]
fig_size[0] = dispCol*4
fig_size[1] = dispRow*4
pyplot.rcParams["figure.figsize"] = fig_size

# Set the warning message filter
warnings.filterwarnings("ignore", category=FutureWarning)
In [10]:
email_notify("Library and Data Loading completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Section 2 - Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

In [11]:
email_notify("Data Summarization and Visualization has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

In [12]:
xy_train_df.head()
Out[12]:
gender PPE DFA RPDE numPulses numPeriodsPulses meanPeriodPulses stdDevPeriodPulses locPctJitter locAbsJitter ... tqwt_kurtosisValue_dec_28 tqwt_kurtosisValue_dec_29 tqwt_kurtosisValue_dec_30 tqwt_kurtosisValue_dec_31 tqwt_kurtosisValue_dec_32 tqwt_kurtosisValue_dec_33 tqwt_kurtosisValue_dec_34 tqwt_kurtosisValue_dec_35 tqwt_kurtosisValue_dec_36 targetVar
87 0 0.83245 0.64735 0.29877 395 394 0.004886 0.000034 0.00085 0.000004 ... 187.3257 149.0010 118.6067 81.8511 46.7268 36.8371 30.9615 14.3100 13.1138 0
498 1 0.83735 0.72443 0.51084 313 312 0.006161 0.000050 0.00117 0.000007 ... 111.9530 64.6800 16.3082 4.4029 2.9942 3.7717 44.8800 45.0639 87.2839 1
688 1 0.11907 0.70259 0.57310 356 354 0.005412 0.000742 0.00173 0.000009 ... 11.0121 9.2407 2.8454 2.9304 2.7041 4.0528 10.1490 12.7809 71.2901 1
504 0 0.38927 0.73284 0.68941 196 195 0.009843 0.000260 0.00487 0.000048 ... 58.9485 59.7702 68.1092 70.4345 74.6148 64.7489 51.1687 44.1420 95.0850 1
691 0 0.80246 0.72914 0.28905 450 449 0.004294 0.000059 0.00065 0.000003 ... 63.6870 49.8495 22.3906 10.8352 8.7913 12.5790 29.9785 33.1159 86.6004 1

5 rows × 754 columns

2.a.ii) Dimensions of the dataset.

In [13]:
xy_train_df.shape
Out[13]:
(529, 754)

2.a.iii) Types of the attributes.

In [14]:
xy_train_df.info()
xy_train_df.dtypes
<class 'pandas.core.frame.DataFrame'>
Int64Index: 529 entries, 87 to 510
Columns: 754 entries, gender to targetVar
dtypes: float64(749), int64(5)
memory usage: 3.0 MB
Out[14]:
gender                          int64
PPE                           float64
DFA                           float64
RPDE                          float64
numPulses                       int64
numPeriodsPulses                int64
meanPeriodPulses              float64
stdDevPeriodPulses            float64
locPctJitter                  float64
locAbsJitter                  float64
rapJitter                     float64
ppq5Jitter                    float64
ddpJitter                     float64
locShimmer                    float64
locDbShimmer                  float64
apq3Shimmer                   float64
apq5Shimmer                   float64
apq11Shimmer                  float64
ddaShimmer                    float64
meanAutoCorrHarmonicity       float64
meanNoiseToHarmHarmonicity    float64
meanHarmToNoiseHarmonicity    float64
minIntensity                  float64
maxIntensity                  float64
meanIntensity                 float64
f1                            float64
f2                            float64
f3                            float64
f4                            float64
b1                            float64
                               ...   
tqwt_kurtosisValue_dec_8      float64
tqwt_kurtosisValue_dec_9      float64
tqwt_kurtosisValue_dec_10     float64
tqwt_kurtosisValue_dec_11     float64
tqwt_kurtosisValue_dec_12     float64
tqwt_kurtosisValue_dec_13     float64
tqwt_kurtosisValue_dec_14     float64
tqwt_kurtosisValue_dec_15     float64
tqwt_kurtosisValue_dec_16     float64
tqwt_kurtosisValue_dec_17     float64
tqwt_kurtosisValue_dec_18     float64
tqwt_kurtosisValue_dec_19     float64
tqwt_kurtosisValue_dec_20     float64
tqwt_kurtosisValue_dec_21     float64
tqwt_kurtosisValue_dec_22     float64
tqwt_kurtosisValue_dec_23     float64
tqwt_kurtosisValue_dec_24     float64
tqwt_kurtosisValue_dec_25     float64
tqwt_kurtosisValue_dec_26     float64
tqwt_kurtosisValue_dec_27     float64
tqwt_kurtosisValue_dec_28     float64
tqwt_kurtosisValue_dec_29     float64
tqwt_kurtosisValue_dec_30     float64
tqwt_kurtosisValue_dec_31     float64
tqwt_kurtosisValue_dec_32     float64
tqwt_kurtosisValue_dec_33     float64
tqwt_kurtosisValue_dec_34     float64
tqwt_kurtosisValue_dec_35     float64
tqwt_kurtosisValue_dec_36     float64
targetVar                       int64
Length: 754, dtype: object

2.a.iv) Statistical summary of all attributes.

In [15]:
xy_train_df.describe()
Out[15]:
gender PPE DFA RPDE numPulses numPeriodsPulses meanPeriodPulses stdDevPeriodPulses locPctJitter locAbsJitter ... tqwt_kurtosisValue_dec_28 tqwt_kurtosisValue_dec_29 tqwt_kurtosisValue_dec_30 tqwt_kurtosisValue_dec_31 tqwt_kurtosisValue_dec_32 tqwt_kurtosisValue_dec_33 tqwt_kurtosisValue_dec_34 tqwt_kurtosisValue_dec_35 tqwt_kurtosisValue_dec_36 targetVar
count 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000 5.290000e+02 ... 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000 529.000000
mean 0.508507 0.748112 0.703186 0.489701 324.170132 322.865784 0.006372 0.000405 0.002404 1.749092e-05 ... 25.214724 21.891424 17.930038 13.612910 12.001219 12.310185 14.989267 14.957852 32.154111 0.746692
std 0.500401 0.166663 0.069964 0.139258 100.497557 100.799319 0.001842 0.000756 0.002811 2.447855e-05 ... 41.184485 32.444435 25.517965 19.740064 17.443065 16.006522 15.573893 14.418290 34.245560 0.435318
min 0.000000 0.041551 0.543500 0.178730 80.000000 73.000000 0.002107 0.000011 0.000210 6.860000e-07 ... 1.520400 1.531700 1.582900 1.747200 1.789500 1.628700 1.861700 1.955900 2.364000 0.000000
25% 0.000000 0.766510 0.646250 0.387350 251.000000 250.000000 0.005047 0.000050 0.000970 5.180000e-06 ... 2.505300 3.350800 3.313700 3.063200 2.898300 3.126600 3.754800 3.774300 4.142300 0.000000
50% 1.000000 0.807660 0.704060 0.485730 320.000000 319.000000 0.006004 0.000078 0.001510 9.320000e-06 ... 5.578900 6.740100 5.625900 4.578700 4.297300 4.785600 7.091500 7.837900 11.771200 1.000000
75% 1.000000 0.832300 0.756830 0.583950 381.000000 380.000000 0.007529 0.000187 0.002560 1.930000e-05 ... 26.493200 25.086800 19.769000 12.785000 10.838200 12.562600 22.824200 22.909000 62.086800 1.000000
max 1.000000 0.907660 0.852640 0.849800 907.000000 905.000000 0.012101 0.003483 0.027750 2.564800e-04 ... 239.788800 203.311300 121.542900 100.706500 82.750200 73.241700 62.007300 57.544300 156.423700 1.000000

8 rows × 754 columns

2.a.v) Count missing values.

In [16]:
print('Number of NaN in the dataframe columns:\n', xy_train_df.isnull().sum())
print('Total number of NaN in the dataframe: ', xy_train_df.isnull().sum().sum())
Number of NaN in the dataframe columns:
 gender                        0
PPE                           0
DFA                           0
RPDE                          0
numPulses                     0
numPeriodsPulses              0
meanPeriodPulses              0
stdDevPeriodPulses            0
locPctJitter                  0
locAbsJitter                  0
rapJitter                     0
ppq5Jitter                    0
ddpJitter                     0
locShimmer                    0
locDbShimmer                  0
apq3Shimmer                   0
apq5Shimmer                   0
apq11Shimmer                  0
ddaShimmer                    0
meanAutoCorrHarmonicity       0
meanNoiseToHarmHarmonicity    0
meanHarmToNoiseHarmonicity    0
minIntensity                  0
maxIntensity                  0
meanIntensity                 0
f1                            0
f2                            0
f3                            0
f4                            0
b1                            0
                             ..
tqwt_kurtosisValue_dec_8      0
tqwt_kurtosisValue_dec_9      0
tqwt_kurtosisValue_dec_10     0
tqwt_kurtosisValue_dec_11     0
tqwt_kurtosisValue_dec_12     0
tqwt_kurtosisValue_dec_13     0
tqwt_kurtosisValue_dec_14     0
tqwt_kurtosisValue_dec_15     0
tqwt_kurtosisValue_dec_16     0
tqwt_kurtosisValue_dec_17     0
tqwt_kurtosisValue_dec_18     0
tqwt_kurtosisValue_dec_19     0
tqwt_kurtosisValue_dec_20     0
tqwt_kurtosisValue_dec_21     0
tqwt_kurtosisValue_dec_22     0
tqwt_kurtosisValue_dec_23     0
tqwt_kurtosisValue_dec_24     0
tqwt_kurtosisValue_dec_25     0
tqwt_kurtosisValue_dec_26     0
tqwt_kurtosisValue_dec_27     0
tqwt_kurtosisValue_dec_28     0
tqwt_kurtosisValue_dec_29     0
tqwt_kurtosisValue_dec_30     0
tqwt_kurtosisValue_dec_31     0
tqwt_kurtosisValue_dec_32     0
tqwt_kurtosisValue_dec_33     0
tqwt_kurtosisValue_dec_34     0
tqwt_kurtosisValue_dec_35     0
tqwt_kurtosisValue_dec_36     0
targetVar                     0
Length: 754, dtype: int64
Total number of NaN in the dataframe:  0

2.a.vi) Summarize the levels of the class attribute.

In [17]:
xy_train_df.groupby('targetVar').size()
Out[17]:
targetVar
0    134
1    395
dtype: int64

2.b) Data visualizations

2.b.i) Univariate plots to better understand each attribute
In [18]:
# Histograms for each attribute
x_train_df.hist(layout=(dispRow,dispCol))
pyplot.show()
In [19]:
# Density plot for each attribute
x_train_df.plot(kind='density', subplots=True, layout=(dispRow,dispCol))
pyplot.show()
In [20]:
# Box and Whisker plot for each attribute
x_train_df.plot(kind='box', subplots=True, layout=(dispRow,dispCol))
pyplot.show()

2.b.ii) Multivariate plots to better understand the relationships between attributes

In [21]:
# Scatterplot matrix
# scatter_matrix(x_train_df)
# pyplot.show()
In [22]:
# Correlation matrix
fig = pyplot.figure(figsize=(16,12))
ax = fig.add_subplot(111)
correlations = x_train_df.corr(method='pearson')
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
pyplot.show()
In [23]:
email_notify("Data Summarization and Visualization completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Section 3 - Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

  • Cleaning data by removing duplicates, marking missing values and even imputing missing values.
  • Feature selection where redundant features may be removed.
  • Data transforms where attributes are scaled or redistributed in order to best expose the structure of the problem later to learning algorithms.
In [24]:
email_notify("Data Cleaning and Transformation has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

3.a) Data Cleaning

In [25]:
# Not applicable for this iteration of the project
# Sample code for performing one-hot-encoding (Block #1 of 2)
# According to the data dictionary, columns SOME_ATTR should be converted to categorical type
# x_train_df['SOME_ATTR'] = x_train_df['SOME_ATTR'].astype('category')

# # Apply the One-Hot-Encoding (dummy variable handling) technique
# x_train_df = get_dummies(x_train_df)
# print(x_train_df.dtypes)
In [26]:
# Sample code for performing one-hot-encoding (Block #2 of 2)
# According to the data dictionary, columns SOME_ATTR should be converted to categorical type
# x_train_df['SOME_ATTR'] = x_train_df['SOME_ATTR'].astype('category')

# # Apply the One-Hot-Encoding (dummy variable handling) technique
# x_test_df = get_dummies(x_test_df)
# print(x_test_df.dtypes)

3.b) Feature Selection

In [27]:
# Not applicable for this iteration of the project
# Sample Code for finding collinear features (Block #1 of 3)
# def findCorrelated(corrmat, cutoff = 0.75):    

# ### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations
# # args:
#     # corrmat: a correlation matrix
#     # cutoff: pairwise absolute correlation cutoff
# # returns:
#     # variables to removed

#     if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square'
#     averageCorr = corrmat.abs().mean(axis = 1)

#     # set lower triangle and diagonal of correlation matrix to NA
#     for row in range(0, len(corrmat)):
#         for col in range(0, (row+1)):
#             corrmat.iloc[row, col] = None

#     # where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it
#     to_delete = list()
#     for col in range(0, len(corrmat.columns)):
#         for row in range(0, len(corrmat)):
#             if(abs(corrmat.iloc[row, col]) > cutoff):
#                 if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row)
#                 else: to_delete.append(col)

#     to_delete = list(set(to_delete))
#     return to_delete
In [28]:
# Sample Code for finding collinear features (Block #2 of 3)
# By using the calculated correlation matrix, find the collinear features
# collinear = findCorrelated(correlations, cutoff = 0.75)
# print('Number of attributes to be removed:', len(collinear))
# dropAttributes = x_train_df.columns[collinear]
# print('Columns to be removed:')
# print(dropAttributes)
In [29]:
# Sample Code for finding collinear features (Block #3 of 3)
# Drop the highly collinear features
# x_train_df.drop(labels=dropAttributes, axis=1, inplace=True)
# x_test_df.drop(labels=dropAttributes, axis=1, inplace=True)
In [30]:
# Not applicable for this iteration of the project
# Sample code for performing Attribute Importance Ranking (Block #1 of 3)
# Feature Importance with Gradient Boosting Classifier
x_impVal = x_train_df.values
y_impVal = y_train_df.values.ravel()
model = GradientBoostingClassifier(random_state=seedNum)
model.fit(x_impVal, y_impVal)
importanceScore = model.feature_importances_
attributeList = x_train_df.columns.tolist()
attributeImportance = pd.DataFrame({'attribute': attributeList, 'importance': importanceScore})
rankedAttributes = attributeImportance.sort_values('importance', ascending=False)
print(rankedAttributes)
                       attribute  importance
125   std_delta_delta_log_energy    0.106770
404      tqwt_entropy_log_dec_12    0.056851
475         tqwt_TKEO_std_dec_11    0.042397
347           tqwt_energy_dec_27    0.039331
341           tqwt_energy_dec_21    0.038348
62            mean_MFCC_6th_coef    0.034723
531      tqwt_medianValue_dec_31    0.030165
483         tqwt_TKEO_std_dec_19    0.029559
425      tqwt_entropy_log_dec_33    0.023272
58            mean_MFCC_2nd_coef    0.016805
132          std_6th_delta_delta    0.016634
120                std_8th_delta    0.015825
353           tqwt_energy_dec_33    0.014847
449        tqwt_TKEO_mean_dec_21    0.014636
134          std_8th_delta_delta    0.013678
122               std_10th_delta    0.013332
465          tqwt_TKEO_std_dec_1    0.012765
375  tqwt_entropy_shannon_dec_19    0.012309
592         tqwt_stdValue_dec_20    0.012016
2                            DFA    0.011840
453        tqwt_TKEO_mean_dec_25    0.011669
118                std_6th_delta    0.011157
338           tqwt_energy_dec_18    0.010977
419      tqwt_entropy_log_dec_27    0.010850
424      tqwt_entropy_log_dec_32    0.010717
327            tqwt_energy_dec_7    0.010532
484         tqwt_TKEO_std_dec_20    0.010332
405      tqwt_entropy_log_dec_13    0.010112
346           tqwt_energy_dec_26    0.009761
732    tqwt_kurtosisValue_dec_16    0.008792
..                           ...         ...
395       tqwt_entropy_log_dec_3    0.000000
409      tqwt_entropy_log_dec_17    0.000000
131          std_5th_delta_delta    0.000000
407      tqwt_entropy_log_dec_15    0.000000
406      tqwt_entropy_log_dec_14    0.000000
135          std_9th_delta_delta    0.000000
136         std_10th_delta_delta    0.000000
138         std_12th_delta_delta    0.000000
402      tqwt_entropy_log_dec_10    0.000000
401       tqwt_entropy_log_dec_9    0.000000
399       tqwt_entropy_log_dec_7    0.000000
398       tqwt_entropy_log_dec_6    0.000000
397       tqwt_entropy_log_dec_5    0.000000
396       tqwt_entropy_log_dec_4    0.000000
394       tqwt_entropy_log_dec_2    0.000000
146                    Ed_7_coef    0.000000
139                           Ea    0.000000
391  tqwt_entropy_shannon_dec_35    0.000000
390  tqwt_entropy_shannon_dec_34    0.000000
140                    Ed_1_coef    0.000000
388  tqwt_entropy_shannon_dec_32    0.000000
142                    Ed_3_coef    0.000000
385  tqwt_entropy_shannon_dec_29    0.000000
143                    Ed_4_coef    0.000000
382  tqwt_entropy_shannon_dec_26    0.000000
145                    Ed_6_coef    0.000000
379  tqwt_entropy_shannon_dec_23    0.000000
378  tqwt_entropy_shannon_dec_22    0.000000
377  tqwt_entropy_shannon_dec_21    0.000000
0                         gender    0.000000

[753 rows x 2 columns]
In [31]:
# Sample code for performing Attribute Importance Ranking (Block #2 of 3)
# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
importanceSum = sum(importanceScore)
maxThreshold = 0.99
i = 0
accumWeight = 0
exit_now = False
while ((i < totAttr) and (not exit_now)) :
    accumWeight = accumWeight + (rankedAttributes.iloc[i]['importance']/importanceSum)
    if (accumWeight >= maxThreshold) :
        exit_now = True
    else :
        i = i + 1
print('Number of attributes contributed to the importance threshold:'+str(i))
lowAttributes = rankedAttributes.iloc[(i):(totAttr),]['attribute'].tolist()
print('Number of attributes found to be of low importance:',len(lowAttributes))
print(lowAttributes)
Number of attributes contributed to the importance threshold:168
Number of attributes found to be of low importance: 585
['tqwt_minValue_dec_11', 'tqwt_medianValue_dec_16', 'tqwt_medianValue_dec_34', 'tqwt_skewnessValue_dec_17', 'locPctJitter', 'tqwt_maxValue_dec_28', 'tqwt_skewnessValue_dec_23', 'app_entropy_log_8_coef', 'tqwt_entropy_shannon_dec_20', 'mean_MFCC_11th_coef', 'std_MFCC_8th_coef', 'tqwt_energy_dec_35', 'tqwt_kurtosisValue_dec_36', 'VFER_NSR_SEO', 'tqwt_skewnessValue_dec_5', 'tqwt_kurtosisValue_dec_28', 'tqwt_skewnessValue_dec_13', 'mean_MFCC_3rd_coef', 'tqwt_maxValue_dec_27', 'std_2nd_delta', 'tqwt_maxValue_dec_10', 'mean_4th_delta_delta', 'tqwt_skewnessValue_dec_8', 'tqwt_skewnessValue_dec_31', 'VFER_SNR_SEO', 'GQ_std_cycle_open', 'app_LT_TKEO_std_6_coef', 'tqwt_maxValue_dec_3', 'app_LT_TKEO_std_1_coef', 'tqwt_skewnessValue_dec_26', 'tqwt_maxValue_dec_33', 'b2', 'tqwt_kurtosisValue_dec_5', 'tqwt_stdValue_dec_29', 'det_LT_TKEO_mean_5_coef', 'tqwt_skewnessValue_dec_33', 'IMF_SNR_TKEO', 'tqwt_maxValue_dec_18', 'mean_6th_delta', 'tqwt_TKEO_mean_dec_35', 'tqwt_skewnessValue_dec_14', 'det_entropy_shannon_6_coef', 'tqwt_maxValue_dec_16', 'app_TKEO_std_9_coef', 'IMF_SNR_entropy', 'std_MFCC_10th_coef', 'tqwt_TKEO_std_dec_13', 'det_LT_TKEO_std_10_coef', 'tqwt_maxValue_dec_35', 'app_LT_TKEO_std_9_coef', 'det_LT_TKEO_mean_4_coef', 'std_MFCC_2nd_coef', 'tqwt_medianValue_dec_20', 'tqwt_kurtosisValue_dec_27', 'tqwt_entropy_shannon_dec_11', 'tqwt_stdValue_dec_13', 'det_LT_entropy_shannon_9_coef', 'tqwt_kurtosisValue_dec_10', 'mean_delta_delta_log_energy', 'tqwt_TKEO_std_dec_4', 'tqwt_skewnessValue_dec_19', 'mean_MFCC_10th_coef', 'tqwt_TKEO_std_dec_7', 'meanNoiseToHarmHarmonicity', 'det_LT_entropy_shannon_10_coef', 'std_1st_delta', 'tqwt_TKEO_std_dec_28', 'app_TKEO_std_2_coef', 'tqwt_entropy_shannon_dec_12', 'app_entropy_log_7_coef', 'tqwt_stdValue_dec_27', 'tqwt_kurtosisValue_dec_15', 'tqwt_TKEO_std_dec_33', 'tqwt_entropy_log_dec_1', 'tqwt_energy_dec_4', 'tqwt_stdValue_dec_31', 'tqwt_stdValue_dec_24', 'tqwt_kurtosisValue_dec_29', 'det_entropy_log_1_coef', 'tqwt_medianValue_dec_26', 'tqwt_entropy_shannon_dec_24', 'tqwt_TKEO_std_dec_5', 'app_entropy_log_3_coef', 'tqwt_energy_dec_23', 'det_TKEO_std_9_coef', 'tqwt_maxValue_dec_2', 'ppq5Jitter', 'tqwt_entropy_log_dec_30', 'mean_9th_delta_delta', 'tqwt_entropy_shannon_dec_18', 'det_LT_entropy_shannon_1_coef', 'ddaShimmer', 'tqwt_TKEO_mean_dec_4', 'tqwt_minValue_dec_6', 'tqwt_energy_dec_10', 'det_TKEO_std_4_coef', 'tqwt_TKEO_mean_dec_33', 'tqwt_energy_dec_2', 'app_LT_TKEO_mean_2_coef', 'tqwt_minValue_dec_4', 'tqwt_medianValue_dec_8', 'det_LT_entropy_log_1_coef', 'tqwt_energy_dec_12', 'tqwt_skewnessValue_dec_1', 'Ed_2_coef', 'tqwt_medianValue_dec_4', 'tqwt_stdValue_dec_35', 'tqwt_minValue_dec_14', 'tqwt_entropy_log_dec_8', 'tqwt_stdValue_dec_5', 'tqwt_minValue_dec_23', 'det_LT_TKEO_std_7_coef', 'tqwt_medianValue_dec_6', 'tqwt_stdValue_dec_33', 'tqwt_TKEO_std_dec_2', 'det_LT_entropy_log_2_coef', 'Ed_5_coef', 'tqwt_stdValue_dec_22', 'tqwt_entropy_shannon_dec_14', 'tqwt_energy_dec_9', 'tqwt_TKEO_mean_dec_5', 'det_entropy_log_10_coef', 'tqwt_meanValue_dec_6', 'tqwt_kurtosisValue_dec_9', 'tqwt_meanValue_dec_17', 'tqwt_meanValue_dec_16', 'tqwt_meanValue_dec_15', 'tqwt_meanValue_dec_14', 'apq3Shimmer', 'tqwt_meanValue_dec_13', 'tqwt_meanValue_dec_12', 'tqwt_meanValue_dec_11', 'tqwt_kurtosisValue_dec_11', 'tqwt_meanValue_dec_8', 'tqwt_kurtosisValue_dec_13', 'tqwt_meanValue_dec_7', 'tqwt_kurtosisValue_dec_12', 'tqwt_meanValue_dec_9', 'tqwt_meanValue_dec_18', 'tqwt_meanValue_dec_5', 'tqwt_meanValue_dec_10', 'VFER_SNR_TKEO', 'tqwt_meanValue_dec_19', 'tqwt_meanValue_dec_31', 'tqwt_kurtosisValue_dec_7', 'tqwt_stdValue_dec_4', 'tqwt_stdValue_dec_3', 'tqwt_stdValue_dec_2', 'tqwt_stdValue_dec_1', 'tqwt_meanValue_dec_36', 'tqwt_meanValue_dec_35', 'tqwt_meanValue_dec_34', 'tqwt_meanValue_dec_33', 'tqwt_meanValue_dec_32', 'tqwt_meanValue_dec_30', 'tqwt_meanValue_dec_20', 'tqwt_meanValue_dec_29', 'tqwt_meanValue_dec_28', 'tqwt_meanValue_dec_27', 'tqwt_meanValue_dec_26', 'tqwt_meanValue_dec_25', 'tqwt_meanValue_dec_24', 'tqwt_meanValue_dec_23', 'tqwt_meanValue_dec_3', 'tqwt_meanValue_dec_22', 'tqwt_meanValue_dec_21', 'tqwt_meanValue_dec_4', 'mean_MFCC_8th_coef', 'tqwt_meanValue_dec_2', 'tqwt_medianValue_dec_14', 'tqwt_kurtosisValue_dec_24', 'tqwt_medianValue_dec_13', 'tqwt_kurtosisValue_dec_25', 'mean_9th_delta', 'tqwt_kurtosisValue_dec_26', 'tqwt_medianValue_dec_11', 'tqwt_medianValue_dec_10', 'mean_10th_delta', 'tqwt_kurtosisValue_dec_30', 'numPeriodsPulses', 'tqwt_medianValue_dec_7', 'tqwt_medianValue_dec_5', 'numPulses', 'tqwt_kurtosisValue_dec_33', 'tqwt_medianValue_dec_3', 'tqwt_medianValue_dec_2', 'mean_11th_delta', 'tqwt_TKEO_std_dec_36', 'tqwt_TKEO_std_dec_35', 'tqwt_TKEO_std_dec_34', 'tqwt_kurtosisValue_dec_34', 'tqwt_TKEO_std_dec_32', 'tqwt_kurtosisValue_dec_35', 'meanPeriodPulses', 'tqwt_medianValue_dec_15', 'tqwt_meanValue_dec_1', 'mean_8th_delta', 'tqwt_medianValue_dec_36', 'tqwt_stdValue_dec_7', 'mean_MFCC_9th_coef', 'tqwt_medianValue_dec_33', 'tqwt_medianValue_dec_32', 'tqwt_kurtosisValue_dec_14', 'locDbShimmer', 'tqwt_medianValue_dec_30', 'tqwt_medianValue_dec_29', 'mean_MFCC_12th_coef', 'mean_delta_log_energy', 'locShimmer', 'mean_0th_delta', 'tqwt_medianValue_dec_24', 'tqwt_medianValue_dec_23', 'mean_4th_delta', 'tqwt_kurtosisValue_dec_19', 'tqwt_kurtosisValue_dec_20', 'tqwt_medianValue_dec_19', 'mean_7th_delta', 'tqwt_kurtosisValue_dec_21', 'tqwt_medianValue_dec_17', 'stdDevPeriodPulses', 'tqwt_stdValue_dec_6', 'tqwt_stdValue_dec_11', 'tqwt_stdValue_dec_8', 'tqwt_stdValue_dec_9', 'VFER_entropy', 'tqwt_maxValue_dec_1', 'tqwt_minValue_dec_36', 'tqwt_minValue_dec_35', 'tqwt_minValue_dec_34', 'IMF_SNR_SEO', 'tqwt_skewnessValue_dec_6', 'tqwt_minValue_dec_32', 'tqwt_minValue_dec_31', 'tqwt_minValue_dec_30', 'tqwt_minValue_dec_29', 'VFER_std', 'tqwt_minValue_dec_28', 'tqwt_minValue_dec_27', 'VFER_mean', 'tqwt_minValue_dec_26', 'tqwt_minValue_dec_25', 'GNE_SNR_SEO', 'tqwt_minValue_dec_24', 'GQ_std_cycle_closed', 'tqwt_minValue_dec_22', 'tqwt_minValue_dec_21', 'tqwt_minValue_dec_20', 'tqwt_skewnessValue_dec_3', 'tqwt_maxValue_dec_4', 'tqwt_skewnessValue_dec_2', 'tqwt_maxValue_dec_17', 'tqwt_maxValue_dec_26', 'tqwt_maxValue_dec_29', 'tqwt_maxValue_dec_30', 'tqwt_maxValue_dec_24', 'tqwt_maxValue_dec_23', 'tqwt_maxValue_dec_22', 'tqwt_maxValue_dec_21', 'tqwt_maxValue_dec_20', 'tqwt_maxValue_dec_19', 'tqwt_maxValue_dec_31', 'tqwt_maxValue_dec_32', 'tqwt_maxValue_dec_5', 'tqwt_maxValue_dec_15', 'tqwt_maxValue_dec_14', 'tqwt_maxValue_dec_13', 'VFER_NSR_TKEO', 'tqwt_maxValue_dec_11', 'tqwt_maxValue_dec_34', 'tqwt_maxValue_dec_9', 'tqwt_maxValue_dec_8', 'tqwt_maxValue_dec_7', 'tqwt_maxValue_dec_36', 'tqwt_minValue_dec_19', 'tqwt_skewnessValue_dec_15', 'tqwt_minValue_dec_18', 'tqwt_kurtosisValue_dec_1', 'tqwt_stdValue_dec_28', 'meanIntensity', 'mean_MFCC_0th_coef', 'tqwt_stdValue_dec_25', 'meanHarmToNoiseHarmonicity', 'tqwt_skewnessValue_dec_32', 'tqwt_skewnessValue_dec_34', 'tqwt_stdValue_dec_23', 'tqwt_stdValue_dec_21', 'tqwt_skewnessValue_dec_36', 'tqwt_stdValue_dec_19', 'tqwt_stdValue_dec_30', 'tqwt_stdValue_dec_18', 'tqwt_stdValue_dec_17', 'tqwt_stdValue_dec_16', 'tqwt_stdValue_dec_15', 'tqwt_kurtosisValue_dec_2', 'mean_MFCC_4th_coef', 'tqwt_kurtosisValue_dec_4', 'tqwt_stdValue_dec_12', 'mean_MFCC_7th_coef', 'apq5Shimmer', 'f1', 'tqwt_skewnessValue_dec_27', 'tqwt_skewnessValue_dec_16', 'tqwt_minValue_dec_8', 'tqwt_minValue_dec_16', 'tqwt_minValue_dec_15', 'tqwt_TKEO_std_dec_31', 'b3', 'tqwt_minValue_dec_13', 'tqwt_minValue_dec_12', 'tqwt_skewnessValue_dec_18', 'IMF_NSR_SEO', 'tqwt_skewnessValue_dec_21', 'tqwt_minValue_dec_9', 'IMF_NSR_TKEO', 'tqwt_stdValue_dec_32', 'tqwt_skewnessValue_dec_22', 'b1', 'IMF_NSR_entropy', 'tqwt_skewnessValue_dec_24', 'tqwt_minValue_dec_3', 'tqwt_minValue_dec_2', 'tqwt_minValue_dec_1', 'tqwt_stdValue_dec_36', 'f2', 'tqwt_stdValue_dec_34', 'det_TKEO_std_2_coef', 'tqwt_TKEO_std_dec_3', 'tqwt_TKEO_std_dec_30', 'det_LT_entropy_log_7_coef', 'app_LT_entropy_log_2_coef', 'app_LT_entropy_shannon_10_coef', 'app_LT_entropy_shannon_9_coef', 'app_LT_entropy_shannon_8_coef', 'app_LT_entropy_shannon_7_coef', 'app_LT_entropy_shannon_6_coef', 'app_LT_entropy_shannon_5_coef', 'app_LT_entropy_shannon_4_coef', 'app_LT_entropy_shannon_3_coef', 'app_LT_entropy_shannon_2_coef', 'app_LT_entropy_shannon_1_coef', 'det_LT_TKEO_std_9_coef', 'det_LT_TKEO_std_8_coef', 'det_LT_TKEO_std_6_coef', 'det_LT_TKEO_std_5_coef', 'det_LT_TKEO_std_4_coef', 'det_LT_TKEO_std_3_coef', 'det_LT_TKEO_std_2_coef', 'det_LT_TKEO_std_1_coef', 'det_LT_TKEO_mean_10_coef', 'det_LT_TKEO_mean_9_coef', 'det_LT_TKEO_mean_8_coef', 'det_LT_TKEO_mean_7_coef', 'det_LT_TKEO_mean_6_coef', 'det_LT_TKEO_mean_3_coef', 'det_LT_TKEO_mean_2_coef', 'det_LT_TKEO_mean_1_coef', 'det_LT_entropy_log_10_coef', 'det_LT_entropy_log_9_coef', 'app_LT_entropy_log_3_coef', 'app_LT_entropy_log_4_coef', 'app_LT_entropy_log_5_coef', 'det_entropy_log_8_coef', 'tqwt_energy_dec_13', 'tqwt_energy_dec_11', 'tqwt_energy_dec_8', 'det_entropy_log_4_coef', 'det_entropy_log_5_coef', 'tqwt_energy_dec_5', 'det_entropy_log_6_coef', 'det_entropy_log_7_coef', 'app_LT_TKEO_std_10_coef', 'app_LT_TKEO_std_8_coef', 'app_LT_TKEO_std_7_coef', 'app_LT_TKEO_std_5_coef', 'app_LT_TKEO_std_4_coef', 'app_LT_TKEO_std_2_coef', 'app_LT_entropy_log_6_coef', 'app_LT_TKEO_mean_10_coef', 'app_LT_TKEO_mean_9_coef', 'app_LT_TKEO_mean_8_coef', 'app_LT_TKEO_mean_7_coef', 'app_LT_TKEO_mean_6_coef', 'app_LT_TKEO_mean_5_coef', 'app_LT_TKEO_mean_4_coef', 'app_LT_TKEO_mean_3_coef', 'app_LT_TKEO_mean_1_coef', 'app_LT_entropy_log_10_coef', 'app_LT_entropy_log_9_coef', 'app_LT_entropy_log_8_coef', 'app_LT_entropy_log_7_coef', 'det_LT_entropy_log_8_coef', 'det_LT_entropy_log_6_coef', 'tqwt_TKEO_std_dec_27', 'det_LT_entropy_log_5_coef', 'app_det_TKEO_mean_6_coef', 'app_det_TKEO_mean_5_coef', 'app_det_TKEO_mean_4_coef', 'app_det_TKEO_mean_3_coef', 'app_det_TKEO_mean_2_coef', 'app_det_TKEO_mean_1_coef', 'app_entropy_log_10_coef', 'app_entropy_log_9_coef', 'det_TKEO_mean_6_coef', 'app_entropy_log_6_coef', 'app_entropy_log_5_coef', 'app_entropy_log_4_coef', 'app_entropy_log_2_coef', 'det_TKEO_mean_7_coef', 'app_entropy_shannon_10_coef', 'app_entropy_shannon_9_coef', 'app_entropy_shannon_8_coef', 'app_entropy_shannon_7_coef', 'app_entropy_shannon_6_coef', 'app_entropy_shannon_5_coef', 'det_TKEO_mean_8_coef', 'det_TKEO_mean_9_coef', 'det_TKEO_mean_10_coef', 'det_TKEO_std_1_coef', 'det_TKEO_std_10_coef', 'det_TKEO_std_8_coef', 'det_TKEO_std_7_coef', 'det_TKEO_std_6_coef', 'det_TKEO_std_5_coef', 'app_det_TKEO_mean_7_coef', 'app_det_TKEO_mean_8_coef', 'app_det_TKEO_mean_9_coef', 'Ed2_6_coef', 'det_LT_entropy_log_4_coef', 'det_LT_entropy_log_3_coef', 'det_TKEO_mean_1_coef', 'det_TKEO_mean_2_coef', 'det_LT_entropy_shannon_6_coef', 'det_LT_entropy_shannon_5_coef', 'det_LT_entropy_shannon_4_coef', 'det_LT_entropy_shannon_3_coef', 'det_LT_entropy_shannon_2_coef', 'Ed2_10_coef', 'Ed2_9_coef', 'Ed2_8_coef', 'Ed2_7_coef', 'Ed2_5_coef', 'app_det_TKEO_mean_10_coef', 'Ed2_4_coef', 'Ed2_3_coef', 'Ed2_2_coef', 'Ed2_1_coef', 'Ea2', 'app_TKEO_std_10_coef', 'app_TKEO_std_8_coef', 'app_TKEO_std_7_coef', 'app_TKEO_std_6_coef', 'app_TKEO_std_5_coef', 'det_TKEO_mean_3_coef', 'det_TKEO_mean_5_coef', 'app_TKEO_std_1_coef', 'det_entropy_log_3_coef', 'tqwt_energy_dec_15', 'tqwt_energy_dec_16', 'tqwt_energy_dec_17', 'std_MFCC_0th_coef', 'tqwt_TKEO_mean_dec_20', 'std_MFCC_6th_coef', 'tqwt_TKEO_mean_dec_17', 'tqwt_TKEO_mean_dec_16', 'tqwt_TKEO_mean_dec_15', 'tqwt_TKEO_mean_dec_14', 'tqwt_TKEO_mean_dec_13', 'tqwt_TKEO_mean_dec_12', 'std_MFCC_7th_coef', 'tqwt_TKEO_mean_dec_10', 'tqwt_TKEO_mean_dec_9', 'tqwt_TKEO_mean_dec_8', 'tqwt_TKEO_mean_dec_7', 'tqwt_TKEO_mean_dec_6', 'tqwt_TKEO_mean_dec_3', 'tqwt_TKEO_mean_dec_2', 'std_MFCC_11th_coef', 'std_0th_delta', 'std_3rd_delta', 'std_5th_delta', 'std_11th_delta', 'tqwt_entropy_log_dec_28', 'std_delta_delta_0th', 'std_1st_delta_delta', 'tqwt_entropy_log_dec_25', 'tqwt_entropy_log_dec_24', 'std_2nd_delta_delta', 'tqwt_entropy_log_dec_22', 'tqwt_TKEO_mean_dec_22', 'tqwt_TKEO_mean_dec_23', 'tqwt_TKEO_mean_dec_24', 'mean_8th_delta_delta', 'tqwt_TKEO_std_dec_26', 'mean_delta_delta_0th', 'tqwt_TKEO_std_dec_24', 'tqwt_TKEO_std_dec_23', 'tqwt_TKEO_std_dec_22', 'tqwt_TKEO_std_dec_21', 'mean_2nd_delta_delta', 'mean_3rd_delta_delta', 'tqwt_TKEO_std_dec_17', 'tqwt_TKEO_std_dec_16', 'tqwt_TKEO_std_dec_15', 'tqwt_TKEO_std_dec_14', 'mean_5th_delta_delta', 'tqwt_TKEO_std_dec_10', 'mean_12th_delta_delta', 'tqwt_TKEO_std_dec_8', 'tqwt_TKEO_std_dec_6', 'det_TKEO_std_3_coef', 'mean_10th_delta_delta', 'tqwt_TKEO_mean_dec_36', 'tqwt_TKEO_mean_dec_34', 'tqwt_TKEO_mean_dec_32', 'tqwt_TKEO_mean_dec_31', 'tqwt_TKEO_mean_dec_30', 'tqwt_TKEO_mean_dec_29', 'tqwt_TKEO_mean_dec_28', 'tqwt_TKEO_mean_dec_27', 'tqwt_TKEO_mean_dec_26', 'tqwt_entropy_log_dec_21', 'std_3rd_delta_delta', 'tqwt_entropy_log_dec_19', 'det_entropy_shannon_2_coef', 'Ed_10_coef', 'tqwt_entropy_shannon_dec_13', 'tqwt_entropy_shannon_dec_10', 'tqwt_entropy_shannon_dec_9', 'tqwt_entropy_shannon_dec_8', 'tqwt_entropy_shannon_dec_7', 'tqwt_entropy_shannon_dec_6', 'tqwt_entropy_shannon_dec_5', 'det_entropy_shannon_1_coef', 'tqwt_entropy_shannon_dec_3', 'tqwt_entropy_shannon_dec_2', 'tqwt_entropy_shannon_dec_1', 'tqwt_energy_dec_36', 'tqwt_energy_dec_34', 'tqwt_entropy_shannon_dec_17', 'det_entropy_shannon_3_coef', 'det_entropy_shannon_4_coef', 'tqwt_energy_dec_31', 'tqwt_energy_dec_30', 'tqwt_energy_dec_29', 'det_entropy_shannon_5_coef', 'det_entropy_shannon_7_coef', 'det_entropy_shannon_8_coef', 'tqwt_energy_dec_24', 'tqwt_energy_dec_22', 'det_entropy_shannon_9_coef', 'det_entropy_shannon_10_coef', 'tqwt_energy_dec_19', 'Ed_9_coef', 'Ed_8_coef', 'tqwt_entropy_log_dec_18', 'tqwt_entropy_log_dec_3', 'tqwt_entropy_log_dec_17', 'std_5th_delta_delta', 'tqwt_entropy_log_dec_15', 'tqwt_entropy_log_dec_14', 'std_9th_delta_delta', 'std_10th_delta_delta', 'std_12th_delta_delta', 'tqwt_entropy_log_dec_10', 'tqwt_entropy_log_dec_9', 'tqwt_entropy_log_dec_7', 'tqwt_entropy_log_dec_6', 'tqwt_entropy_log_dec_5', 'tqwt_entropy_log_dec_4', 'tqwt_entropy_log_dec_2', 'Ed_7_coef', 'Ea', 'tqwt_entropy_shannon_dec_35', 'tqwt_entropy_shannon_dec_34', 'Ed_1_coef', 'tqwt_entropy_shannon_dec_32', 'Ed_3_coef', 'tqwt_entropy_shannon_dec_29', 'Ed_4_coef', 'tqwt_entropy_shannon_dec_26', 'Ed_6_coef', 'tqwt_entropy_shannon_dec_23', 'tqwt_entropy_shannon_dec_22', 'tqwt_entropy_shannon_dec_21', 'gender']
In [32]:
# Sample code for performing Attribute Importance Ranking (Block #3 of 3)
# Removing the unselected attributes from the training and validation dataframes
x_train_df.drop(labels=lowAttributes, axis=1, inplace=True)
x_test_df.drop(labels=lowAttributes, axis=1, inplace=True)
/home/dlowe/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3940: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
In [33]:
# Not applicable for this iteration of the project
# Sample code for performing Recursive Feature Elimination (Block #1 of 3)
# Using the Logistic Regression algorithm, we perform the Recursive Feature Elimination (RFE) technique
# x_rfeVal = x_train_df.values
# y_rfeVal = y_train_df.values.ravel()
# estimator = LogisticRegression(random_state=seedNum)
# selector = RFE(estimator, 40)
# selector = selector.fit(x_rfeVal, y_rfeVal)
# print('The number of selected features:',selector.n_features_)
# print('The mask of selected features:\n',selector.support_)
# print('The mask of selected features:\n',selector.ranking_)
In [34]:
# Sample code for performing Recursive Feature Elimination (Block #2 of 3)
# Remove the unselected attributes from the training dataframes
# x_train_df = x_train_df[x_train_df.columns[selector.support_]]
# print(x_train_df.shape)
# x_train_df.head()
In [35]:
# Sample code for performing Recursive Feature Elimination (Block #3 of 3)
# Remove the unselected attributes from the validation dataframes
# x_test_df = x_test_df[x_test_df.columns[selector.support_]]
# print(x_test_df.shape)
# x_test_df.head()

3.c) Data Transforms

In [36]:
# Not applicable for this iteration of the project
# Sample code for performing SMOTE transformation on the training data (Block #1 of 2)
# x_train_imbal = x_train_df.values
# y_train_imbal = y_train_df.values.ravel()
# print("x_train_imbal.shape: {} y_train_imbal.shape: {}".format(x_train_imbal.shape, y_train_imbal.shape))
In [37]:
# Sample code for performing SMOTE transformation on the training data (Block #2 of 2)
# print ('Original class distribution - 1: {} | 0: {}'.format((y_train_imbal==1).sum(), (y_train_imbal==0).sum()))
# sm = SMOTE(random_state=seedNum, sampling_strategy=1.0)
# x_train, y_train = sm.fit_sample(x_train_imbal, y_train_imbal)
# print ('Resampled class distribution - 1: {} | 0: {}'.format((y_train==1).sum(), (y_train==0).sum()))

3.d) Display the Final Datasets for Model-Building

In [38]:
# We finalize the training and testing datasets for the modeling activities
x_train = x_train_df.values
y_train = y_train_df.values
x_test = x_test_df.values
y_test = y_test_df.values
print("x_train.shape: {} y_train.shape: {}".format(x_train.shape, y_train.shape))
print("x_test.shape: {} y_test.shape: {}".format(x_test.shape, y_test.shape))
email_notify("Data Cleaning and Transformation completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
x_train.shape: (529, 168) y_train.shape: (529,)
x_test.shape: (227, 168) y_test.shape: (227,)

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

  • Defining test options such as cross validation and the evaluation metric to use.
  • Spot checking a suite of linear and nonlinear machine learning algorithms.
  • Comparing the estimated accuracy of algorithms.

For this project, we will evaluate one linear, two non-linear and four ensemble algorithms:

Linear Algorithm: Logistic Regression

Non-Linear Algorithms: Decision Trees (CART) and k-Nearest Neighbors

Ensemble Algorithms: Bagged CART, Random Forest, Extra Trees, and Stochastic Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Set test options and evaluation metric

In [39]:
# Run algorithms using 10-fold cross validation
num_folds = 10
scoring = 'accuracy'
In [40]:
# Set up Algorithms Spot-Checking Array
models = []
models.append(('LR', LogisticRegression(random_state=seedNum, n_jobs=cpu_num)))
models.append(('CART', DecisionTreeClassifier(random_state=seedNum)))
models.append(('KNN', KNeighborsClassifier(n_jobs=cpu_num)))
models.append(('BT', BaggingClassifier(random_state=seedNum, n_jobs=cpu_num)))
models.append(('RF', RandomForestClassifier(random_state=seedNum, n_jobs=cpu_num)))
models.append(('ET', ExtraTreesClassifier(random_state=seedNum, n_jobs=cpu_num)))
models.append(('GBM', GradientBoostingClassifier(random_state=seedNum)))
# models.append(('XGB', XGBClassifier(random_state=seedNum, n_jobs=cpu_num)))
results = []
names = []
metrics = []
In [41]:
# Generate model in turn
for name, model in models:
	email_notify("Algorithm "+name+" modeling has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
	startTimeModule = datetime.now()
	kfold = KFold(n_splits=num_folds, random_state=seedNum)
	cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	metrics.append(cv_results.mean())
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
	print ('Model training time:',(datetime.now() - startTimeModule))
	email_notify("Algorithm "+name+" modeling completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Average metrics ('+scoring+') from all models:',np.mean(metrics))
LR: 0.786502 (0.067795)
Model training time: 0:00:00.281299
CART: 0.820573 (0.057738)
Model training time: 0:00:00.544050
KNN: 0.727903 (0.054359)
Model training time: 0:00:00.071073
BT: 0.839478 (0.033653)
Model training time: 0:00:03.466121
RF: 0.865965 (0.053536)
Model training time: 0:00:00.324856
ET: 0.877358 (0.053533)
Model training time: 0:00:00.153678
GBM: 0.890493 (0.054459)
Model training time: 0:00:07.068690
Average metrics (accuracy) from all models: 0.8297532656023223

4.b) Spot-checking baseline algorithms

In [42]:
fig = pyplot.figure(figsize=(16,12))
fig.suptitle('Algorithm Comparison - Spot Checking')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

Section 5 - Improve Accuracy

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

5.a) Algorithm Tuning

In [43]:
# Set up the comparison array
results = []
names = []
In [44]:
# Tuning algorithm #1 - Extra Trees
email_notify("Algorithm #1 tuning has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
startTimeModule = datetime.now()
paramGrid1 = dict(n_estimators=np.array([50,100,200,300,400]))
model1 = ExtraTreesClassifier(random_state=seedNum, n_jobs=cpu_num)
kfold = KFold(n_splits=num_folds, random_state=seedNum)
grid1 = GridSearchCV(estimator=model1, param_grid=paramGrid1, scoring=scoring, cv=kfold)
grid_result1 = grid1.fit(x_train, y_train)

print("Best: %f using %s" % (grid_result1.best_score_, grid_result1.best_params_))
results.append(grid_result1.cv_results_['mean_test_score'])
names.append('ET')
means = grid_result1.cv_results_['mean_test_score']
stds = grid_result1.cv_results_['std_test_score']
params = grid_result1.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
email_notify("Algorithm #1 tuning completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
Best: 0.880907 using {'n_estimators': 100}
0.875236 (0.044553) with: {'n_estimators': 50}
0.880907 (0.041987) with: {'n_estimators': 100}
0.869565 (0.035069) with: {'n_estimators': 200}
0.873346 (0.032565) with: {'n_estimators': 300}
0.877127 (0.031704) with: {'n_estimators': 400}
Model training time: 0:00:16.337793
In [45]:
# Tuning algorithm #2 - Stochastic Gradient Boosting
email_notify("Algorithm #2 tuning has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
startTimeModule = datetime.now()
paramGrid2 = dict(n_estimators=np.array([100,200,300,400,500]))
model2 = GradientBoostingClassifier(random_state=seedNum)
kfold = KFold(n_splits=num_folds, random_state=seedNum)
grid2 = GridSearchCV(estimator=model2, param_grid=paramGrid2, scoring=scoring, cv=kfold)
grid_result2 = grid2.fit(x_train, y_train)

print("Best: %f using %s" % (grid_result2.best_score_, grid_result2.best_params_))
results.append(grid_result2.cv_results_['mean_test_score'])
names.append('GBM')
means = grid_result2.cv_results_['mean_test_score']
stds = grid_result2.cv_results_['std_test_score']
params = grid_result2.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
email_notify("Algorithm #2 tuning completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
Best: 0.901701 using {'n_estimators': 500}
0.890359 (0.054423) with: {'n_estimators': 100}
0.899811 (0.046891) with: {'n_estimators': 200}
0.899811 (0.042919) with: {'n_estimators': 300}
0.899811 (0.042919) with: {'n_estimators': 400}
0.901701 (0.041015) with: {'n_estimators': 500}
Model training time: 0:01:43.330914

5.b) Compare Algorithms After Tuning

In [46]:
fig = pyplot.figure(figsize=(16,12))
fig.suptitle('Algorithm Comparison - Post Tuning')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

Section 6 - Finalize Model

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

  • Using an optimal model tuned to make predictions on unseen data.
  • Creating a standalone model using the tuned parameters
  • Saving an optimal model to file for later use.
In [47]:
email_notify("Model Validation and Final Model Creation has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

6.a) Predictions on validation dataset

In [48]:
model = GradientBoostingClassifier(n_estimators=500, random_state=seedNum)
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
0.8942731277533039
[[ 40  18]
 [  6 163]]
              precision    recall  f1-score   support

           0       0.87      0.69      0.77        58
           1       0.90      0.96      0.93       169

   micro avg       0.89      0.89      0.89       227
   macro avg       0.89      0.83      0.85       227
weighted avg       0.89      0.89      0.89       227

6.b) Create standalone model on entire training dataset

In [49]:
startTimeModule = datetime.now()

# Combining the training and testing datasets to form the complete dataset that will be used for training the final model
x_complete = np.vstack((x_train, x_test))
y_complete = np.concatenate((y_train, y_test))
print("x_complete.shape: {} y_complete.shape: {}".format(x_complete.shape, y_complete.shape))

finalModel = GradientBoostingClassifier(n_estimators=500, random_state=seedNum)
finalModel.fit(x_complete, y_complete)
print ('Model training time:',(datetime.now() - startTimeModule))
x_complete.shape: (756, 168) y_complete.shape: (756,)
Model training time: 0:00:05.232414

6.c) Save model for later use

In [50]:
# modelName = 'finalModel_BinaryClass.sav'
# dump(finalModel, modelName)
In [51]:
email_notify("Model Validation and Final Model Creation completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Total time for the script:',(datetime.now() - startTimeScript))
Total time for the script: 0:11:33.461544